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Abstract 

The predictive performance of any inferential model is critical to its practical suc¬ 
cess, but quantifying predictive performance is a subtle statistical problem. In this 
paper I show how the natural structure of any inferential problem dehnes a canonical 
measure of relative predictive performance and then demonstrate how approximations 
of this measure yield many of the model comparison techniques popular in statistics 
and machine learning. 

Because any inferential method is built upon assumptions, one of the most important 
aspects of any statistical analysis is assessing the validity of the underlying assumptions. 
Although there are few model assessment approaches that claim to validate a model in 
isolation, there is a rich history of comparative techniques in the statistics literature, from 
visual residual analyses to scoring rules and predictive cross validation to the myriad of 
information criteria. The practical challenge in applying these methods, however, is in 
determining the ultimate accuracy of their assessments and hence which might be most 
appropriate to a given problem. 

In this paper I demonstrate that any inferential system admits a canonical measure of 
comparative predictive performance, here termed a relative predictive performance score. 
Moreover, I show how many of the model comparison techniqnes in practice today arise 
as approximations to these canonical scores. This fonndational perspective provides a 
common context for understanding the advantages and disadvantages of each technique 
both in theory and in practice. 

After reviewing the basic assumptions common to most inferential techniques, in partic¬ 
ular the assumptions of frequentist and Bayesian inference, I demonstrate first how relative 
predictive performance scores arise naturally from these assumptions and then how var¬ 
ious approximations of these scores yield existing model comparison techniques. In the 
latter I emphasize how this foundational construction immediately identifies the practical 
consequences of each approximation. 
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1 Foundational Assumptions of Inference 


The most fundamental assumption underlying any attempt at inference is the existence of 
some latent data generating process, vr, responsible for generating measurements. Formally 
we assume that measurements are drawn from some measurable space, (Y, T), and that the 
data generating process itself can be modeled mathematically by a single, time-invariant 
probability measure, 

[ 0 , 1 ]. 

In order to make this construction as general as possible I impose no additional structure, 
such as a particular topology or metric, on Y, nor make any philosophical interpretation 
of this latent data generating process, in particular its interpretation as an ontological 
truth or just an epistemological impression. Consequently the following results will hold 
regardless of any deeper meaning of the measurement process itself. 

Inference is then a formal effort to learn the latent data generating process given a 
measurement, y € T, by identifying tt from the space of all probability measures, P, on 
the measurement space. Unfortunately, exploring the entirety of P for any problem is far 
too unwieldly, and in order to construct practical inferential methods we first have to limit 
ourselves to a more manageable space of data generating processes. 

An inferential model is the selection of a distinguished subset of data generating pro¬ 
cesses, X d P; ’va the spirit of Dennis Lindley I refer to such subsets as small worlds [I]. As 
with the latent data generating process, I am careful not to assign any particular meaning 
to the small world - it can be a phenomenological model motivated by mathematical and 
practical convenience, a theoretical model motivated by a specific scientific hypothesis, or, 
as is most common in practice, a delicate combination of the two. In this paper I denote 
the measure corresponding to a given element of the small world, x (z X, as 

TT^-.y ^[0,1], 

One assumption that I have explicitly not made is that the chosen small world need 
contain the latent data generating process (Figured]). In particular, assuming that the 
small world rarely, if ever, contains the latent data generation process formalizes the Box- 
ian philosophy that “all models are wrong but some are useful” [2| . Although developments 
in computation and theory, such as statistical nonparametrics, have enabled the construc¬ 
tion of increasingly complex models and less-small worlds, the intricacy of any realistic 
measurement should continue to inspire skepticism in the sufficiency of any small world. 

From this perspective, the ultimate utility of any inferential procedure is not in whether 
it can find the latent data generating process exactly but rather in how well it can approx¬ 
imate the latent data generating process. In particular, the fidelity of a procedure is often 
judged based on its predictive performance. Exactly how we define predictive performance 
depends intimately on how inference is implemented, which itself depends on the funda¬ 
mental interpretation of probability. 
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(a) 


(b) 


Figure 1: Inference requires the selection of a distinguished subset of data generating 
processes that (a) may or (b) may not contain the latent data generating process, vr. The 
Boxian philosophy asserts that the former is impossible in practical problems but that we 
may still hope that some data generating process in X will well-approximate vr. 

2 Inference in the Small World 

We have already assumed that probability theory adequately describes the measurement 
process, but that does not have to be the only application of probability theory. While 
frequentist inference limits probabilities to the data, Bayesian inference also endows the 
small world with a probabilistic interpretation. 

In the following sections I review the measure-theoretic construction of frequentist and 
Bayesian inference. The formality is a necessary evil in order to identify the canonical mea¬ 
sures of predictive performance that do not rely on the additional structure of a particular 
measurement space. 

2.1 Frequentist Inference 

In frequentist inference probabilities are defined strictly as frequencies of repeatable 
events, namely measurements. Consequently probability theory applies to only the latent 
data generating process, vr, and the data generating processes in the small world, {tTx}- 
Considered as a family of probability distribution functions on the measurement space, the 
small world, 

TTa; : X X T [0, 1] , 

is otherwise known as a likelihood function. 

Given this rigid definition of probability, the only way we can construct a complete 
predictive distribution is by selecting a single element of the small world and utilizing the 
corresponding data generating process. One of the most prolific approaches to selecting 
such an element is with the use of estimators, functions of the data that identify some 
aspect of the latent data generating process. Formally, estimators are defined as maps 
from the measurement space to some auxiliary space, e :Y —)• Z. 
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Given a set of estimators, E, we can quantify how well a given estimator identifies the 
latent data generating process with a loss function, 

L:Y X E^R. 


The corresponding risk of an estimator is defined as 

R :X X £' —> R 

(x,e) ^ j^Trx{dy)L{y,e), 

and minimax estimators are defined by the optimality criterion, 

Cm = argmin max R{x, e). 

With a map f : Z ^ X we can then select a single element of the small world and define 
the subsequent predictive distribution for new data by 


TTY\y — ’^ficMiy))- 

For example, consider the circumstance where the small world contains the latent data 
generation process, 

TT = Xtt € X, 

and take any map g : X ^ Z with a well-defined inverse, g~^ : g{X) X. If Z is a metric 
space then a natural loss function is given by the distance function, 

L{y,e) = D{g{xT,) ,e{y)) ; 

the resulting minimax estimator, cm, approximates the true value of the function, g{xTf), 
which then identifies a unique element of the small world, 

X = g~^ o Cm '.Y ^ X 

y ^ 9~^{eM{y)) ■ 

Maximum likelihood estimators avoid the need for a loss function by using the likelihood 
itself. Given any reference measure. A, with respect to which every element of the small 
world is absolutely continuous, the maximum likelihood estimator is defined as 


^MLE '-Y ^ X 


dux 

y 1 -^ argmax—— 
x€X dA 


(y)- 
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Provided that the maximum is unique, xmle identifies a unique element of the small world 
and hence an unambiguous predictive distribution. 

The practical utility of a frequentist estimator is a subtle issue. When the small world 
does not contain the latent data generating process, for example, any predictive distribution 
derived from an estimator will never be able to recover the latent data generating process 
exactly. Moreover, even if the small world does contain the true data generating process 
there is no guarantee that an estimator evaluated at a given measurement will identify it. 
A given estimator may be unidentified and unable to select a single element of the small 
world at all, or it may simply be inaccurate or imprecise. In any case we must be skeptical 
of how well '^Y\y approximates the latent data generating process. 


2.2 Bayesian Inference 

Bayesian inference considers a more general interpretation of probability that encom¬ 
passes not just frequencies but any self-consistent system of uncertainty. Consequently we 
can assign probability distributions to not only the measurement space but also the small 
world itself given the choice of a ci-algebra, A, on X. 

From this perspective, the small world now defines a regular conditional probability 
distribution, 

'Ky\x : T X a ^ [0,1], 

with respect to the canonical projection operator. 


vjx - Y x X ^ X, 


on the product space of measurements and the small world, Y xX. The difference between 
this regular conditional probability distribution and the frequentist likelihood function is 
largely one of interpretation and, following convention, I will refer to both objects as 
likelihoods. 

Inference proceeds with the introduction of a prior distribution over the small world, 

TTx ■ X —>• [ 0 , 1 ], 

that encodes all information about the latent data generating process within the context 
of the small world before the current measurement is made. Together with the likelihood, 
the prior distribution defines a joint distribution on the product space of measurements 
and the small world, ttyxX, and information about the small world given a measurement 
is encoded in the regular conditional probability distribution, 

'^X\y '■ X X Y —>■ [0, 1], 

defined with respect to the second canonical projection operator, 


VUY :Y xX ^Y. 
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Conditioning TTx\y on a given measurement, y, gives the posterior distribution over the 
small world, 


'^X\y 


d'KY\x 

d (w)* TTyxX 



TTy. 


When y ~ X ~ M”, the posterior density with respect to a reference Lebesgue measure is 
given by the celebrated Bayes’ Rule, 


. I . _ '^Y\x{y\x)xx{x) 

T^X\yXy J^^dx7rY\xiy\x)TTx{x)' 


Given the more general application of probability in Bayesian inference we can construct 
a predictive distribution not only by selecting a single element of the small world but also by 
averaging the elements of the small world with respect to a given probability distribution. 
The prior predictive distribution, for example, is given by weighting each element of the 
small world according to the prior distribution. 


Tr^"°\dy) = f TTx{dx)TrY\x{dy,x). 

Jx 

Similarly, the posterior predictive distribution is given by weighting each element of the 
small world according to the posterior distribution, 

xxiyidxly) 7rY\x{dy\x). 

Because it learns from the measured data, the posterior predictive distribution should be 
a better approximation of the latent data generating process provided that the modeling 
assumptions, such as the choice of the small world and the prior distribution, are compatible 
with the true data generating process. 

As in frequentist inference, the performance of either predictive distribution depends 
critically on the assumptions in their construction. Some means of comparing the chosen 
predictive distribution to the latent data generating process is vital for validating the 
modeling assumptions and ensuring inferences that perform well in practice. 


3 Validating Inference 

Although the frequentist and Bayesian approaches have different means of inferring predic¬ 
tive distributions, they can both succumb to the same pathologies that jeopardize predictive 
performance. 

The most obvious pathology is model misfit where the true data generating process is 
not contained within the small world, tt ^ X, and any inferential method will be able to 
approximate the exact predictive distribution only so well (Figure [2]). Even if the small 
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(a) 


(b) 


Figure 2: Because (a) frequentist estimators and (b) Bayesian prior and posterior distribu¬ 
tions are limited to the small world, neither approach will be able to construct a predictive 
distribution capable of exactly modeling the latent data generating process, tt, when it is 
not an element of the small world. 



Figure 3: Even when the small world does contain the latent data generating process, vr, 
inferences are not guaranteed to capture it. Here (a) a frequentist estimator evaluated at 
a given measurement strays from vr while (b) a Bayesian prior or posterior distribution 
concentrates away from vr. In either case the predictive distributions will be biased away 
from the latent data generating process. 


world contains the latent data generating process, however, inferences still may not be able 
to hnd it because they overfit to irrelevant structure in the measurement, such as purely 
stochastic noise (Figure [3]). In practice these two pathologies are somewhat antagonistic - 
making a model more complex in order to reduce misfit often renders it more vulnerable 
to noise and hence subject to overhtting. 

Consider, for example, the measurement space E = (R x with two data gener¬ 

ating processes: a Gaussian distribution centered on a quartic polynomial 

2/1,n ~ U{-1, 1) , 2/2,n|2/l,n ~ AA I ^ | , n = 1,..., 12, (1) 

\fc=o / 

and a Gaussian distribution centered on a constant, 

2/1,n ~ f7(-l, 1), 2/2,n ~ N{co,a^) , n = 1,..., 12. (2) 
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Latent Data Distribution 


Maximum Likelihood Predictive Distribution 


Posterior Predictive Distribution 



Figure 4; Misfit occurs when an inferential model cannot capture the complexity of the 
latent data generating process, for example when models assuming the simple data gen¬ 
erating process ([2]) are fit to (a) data generated according to the more complex process 
©• Because the assumptions are too restrictive, the resulting (b) maximum likelihood 
predictive distribution and (c) posterior predictive distribution are poor approximations to 
the latent data generating process. Compare to Figure [2 


Misfit occurs when the measurement is generated from the more complex process ([T]) (Fig¬ 
ure 0^) but the inferential models assume the simpler process (l2|) ; in this case the resulting 
predictive distributions (Figure Hb, c) will never be able to capture the latent data gen¬ 
erating process. On the other hand, overfit occurs when the measurement is generating 
from the simpler process (Figure Eb) but the inferential models assume the more complex 
process. The resulting predictive distributions (Figure Eb, c) will overfit to the Gaussian 
noise, inducing a bias away from the latent data generating process. In both cases the 
Bayesian analysis uses the conjugate prior 

7rx(c, cj) = MultiNormalGamma(/.iQ, Aq, ao,/So) 


with 


Mo = 0 

Ao = 0.001 • I 
cto = 0.5 
/3o = 0.5. 

Reliable inferences consequently require some predictive validation to ensure that, even 
if the model misfits or overhts, the resulting predictive distribution approximates the latent 
data generating process sufficiently well. One immediate strategy is to test the model 
within a null hypothesis significance testing framework, rejecting if the measured data is 
sufficiently unlikely with respect to the inferred predictive distribution. By construction, 
however, we make no attempt to model anything outside of the small world, let alone 
its entire complement, which prevents us from constructing a valid alternative hypothesis 






Latent Data Distribution 


Maximum Likelihood Predictive Distribution 


Posterior Predictive Distribution 



yi vi yi 

(a) (b) (c) 

Figure 5: Overfitting occurs when an inferential model has too much flexibility relative 
to the latent data generating process, for example when models assuming a complex data 
generating process m are fit to (a) data generating according to the simpler data generating 
process ©• Both the (b) maximum likelihood predictive distribution and (c) posterior 
predictive distribution recklessly fit to the Gaussian noise in the data, biasing predictions 
away from the latent data generating process. Compare to Figure El 


needed to calibrate such tests. In order to quantify predictive performance without looking 
outside of the small world we need to compare the predictive distribution to the latent data 
generating process directly. 

[7] considered many possible strategies for comparing predictive distributions to the 
latent data generating process, but almost all of them require endowing the small world 
with additional structure, such as a metric or a distinguished test function, that limits the 
ultimate scope of the validation. Only one of the approaches considered arises canonically 
from the general construction of inference - the Kullback-Leibler divergence [8]. In this 
section I discuss how the Kullback-Leibler divergence defines a measure of relative predic¬ 
tive performance, although one that cannot be computed in practice. I then consider a 
manipulation of the Kullback-Leibler divergence that defines scores of relative predictive 
performance that are amenable to approximations, and finally I show how various approx¬ 
imation strategies give rise to many of the model comparison techniques already popular 
in practice. 

3.1 Relative Predictive Performance Measures 

In order to construct a measure of predictive performance we have to compare some inferred 
predictive distribution, to the latent data generating process, vr. Without endowing 
the measurement space with any particular structure, the only canonical way of comparing 
two distributions, n and ly, on Y is with an /-divergence [9], 

DfifJ^Wiy) = i^{dy) f > 
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where / : M ^ M is any convex function satisfying /(I) = 0. Moreover, the only /- 
divergence that respects any product structure of the measurement space and allows us to 
marginalize out irrelevant structure as necessary is the Kullback-Leibler divergence, 



The Kullback-Leibler divergence vanishes only when the two measures are equal and 


monotonically increases as the two measures deviate, approaching infinity when v is not 
absolutely-continuous with respect to /r. 

Because the Kullback-Leibler divergence is not symmetric there are two possible ways 
that we might use it to compare the latent data generating process and an inferred 
predictive distribution. Using the inferred predictive distribution as the base measure, 
KL(Ty|j^ II tt), considers the predictive performance only where '^Y\y concentrates and con¬ 
sequently does not penalize predictive distributions that completely ignore neighborhoods 
supported by vr. In an extreme limit, this divergence does not even penalize predictive 
distributions that are not absolutely continuous with respect to the latent data generating 
process. In order to truly assess the inferred predictive distribution we need to instead 
base the divergence on the latent data generating process itself. 



Here we will use this form of the Kullback-Leibler divergence to quantify the validity of our 
modeling assumptions, but it can also be used to construct a more elaborate sensitivity 
analysis of those assumptions [TO] . 

As with null hypothesis significance testing, the Kullback-Leibler divergence cannot be 
calibrated, in other words there is no canonical threshold below which we can declare our 
model assumptions valid. Unlike hypothesis testing, however, the difference between two 
divergences is meaningful, allowing us to quantify the relative performance of ^Y\y com¬ 
pared to some other predictive distribution. Although KL(7r || ^Y\y') cannot be computed 
without assuming a priori knowledge of the true data generation process, we can manipu¬ 
late the divergence into a more advantageous form without compromising its quantification 
of relative performance. 

Let A be any reference measure with respect to which both the true data generating 
process and the inferred predictive distribution are absolutely continuous. We can then 
dehne a relative predictive performance score as 



(3) 
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The difference between any relative predictive performance scores is the same as the differ¬ 
ence of the equivalent Kullback-Leibler divergences and so they quantify the same relative 
performance, but because the densities d^yj^^/dA are independent of the latent data gener¬ 
ating process these relative scores can be approximated using only sampled measurements 
from TT. Relative predictive performance scores also have the welcome interpretation as 
expected logarithmic score functions [mis]. 

The ultimate utility of these relative predictive performance scores then depends on 
the accnracy and precision of the chosen approximation strategy. 

3.2 Approximating Relative Predictive Performance Scores 

Although relative predictive performance scores cannot be calculated exactly, their con¬ 
struction makes them amenable to a variety of approximations. For example, given an 
ensemble of A -|- 1 measurements we could construct a Monte Carlo estimator, 

N j~ 

6{7r II nY\y) « 

n=l 

with vanishing bias and quantifiable variance. Unfortunately, in practice we rarely have an 
ensemble of measurements and instead have to consider approximations that ntilize only a 
single measnrement. 

3.2.1 Delta Estimators 

An immediate approximation of relative predictive performance scores derives from making 
a delta approximation of the latent data generating process, -n ^ 5y, which gives 

(vr 11 ^Y\y) = - log (y). 

Using the same measurement to learn the model and then validate it introduces a 
bias that makes delta estimators susceptible to overfitting. Moreover, the underlying delta 
approximation typically induces a large variance in the estimator, making fine comparisons 
between models difficult if not impossible. 

3.2.2 Hold-out Estimators 

More sophisticated estimates of relative predictive performance scores can be constructed 
by using the given measurement to simulate an ensemble of measurements. 

Assuming that the measurement space has a product structure, Y = nti Yn, any 
measurement can be partitioned in two subsets of size Ni and N 2 . Hold-out estimators use 
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one of these partitions, often denoted the training data, to infer a predictive distribution 
and the remaining partition, denoted the validation data, to construct a delta estimator. 


fe(vr II 7rY\y) = - log 


N/N2 


N2 ® 


dvr- 


y\yi 


dA 


(1/2) ■ 


The simulated partitions in the validation data not only promise a more precise estimate 
but also admit the estimation of the estimator variance using the Monte Carlo standard 
error. This, however, comes with the assumption the naive scaling of the predictive den¬ 
sity inferred from the training data is a reasonable approximation to the predictive density 
inferred from the full measurement. When data are sparse relative to the model complex¬ 
ity this assumption can severely bias the estimator; for example, predictive distributions 
inferred from small partitions are more susceptible to overfitting, artificially penalizing the 
predictive performance of model. 

Moreover, the product structure of the measurement space necessary to construct hold¬ 
out estimators precludes many structured measurements, such as those arising from some 
hierarchical models, networks, and time series. 


3.2.3 Jackknife Estimators 


In order to compensate for the some of the potential bias in hold-out estimators we can 
appeal to a jackknife estimator [12] which averages over the possible assignments of training 
and validation data. Partitioning the measurement into K subsets of size M = N/K, the 
jackknife estimator is given by 


II T^Y\y) 


1 ^ 

k=l 

1 ^ 



jv 

M 


K 


k=l 

K 


^log 

k=l 


‘^'^y\y\yk 

dA 


iVk ), 


where y\yk are often denoted the /cth training data and yk the kth testing data. This ap¬ 
proach can also be readily generalized to a bootstrap estimator m which samples training 
and validation data with replacement. 

The averaging over partitions typically reduces the bias of the relative predictive per¬ 
formance score estimation but the variance can still be quite large. Moreover, the K fits 
required to construct the jackknife estimator can be prohibitively expensive in practice. 
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3.3 Constructing Relative Predictive Performance Measures 

When we apply these approximation strategies to the predictive distributions arising in 
frequentist and Bayesian inference we immediately recover many of the comparative meth¬ 
ods that have arisen and proved empirically effective in statistical practice. In this section 
I detail many of these methods to emphasize the unifying nature of this foundational per¬ 
spective. 

3.3.1 Comparing Likelihoods 

In frequentist methods the inferred predictive distribution is given by a single element in 
the small world or, equivalently, evaluating the likelihood at a single point. 

Explicit use of delta estimators of likelihood-based relative predictive performance 
scores provide a formal justification of the visual residual analysis HUES] ubiquitous in 
not only statistics but also the physical sciences. Moreover, when augmented with an ap¬ 
propriate complexity penalty the the reuse estimator reduces to the Akaike Information 
Criterion m- 

Hold-out and jackknife estimators of likelihood-based relative predictive performance 
scores immediately yields predictive log loss hold-out validation and cross validation, re¬ 
spectively, which have become almost fundamental principles in the practice of modern 
machine learning [13IH]. 

The potential pathologies of these approximations manifest even in the simple misfit 
and overfit examples introduced above. I generate an ensemble of data from the latent data 
generating process and compare the exact likelihood predictive performance score based on 
a reference Lebesgue measure, 6, to each estimate, 5 (Figures [6113- The partitions for the 
hold-out estimators consisted of six data each, the minimum required for finite maximum 
likelihood estimates, while the K = 6 jackknife partitions each consisting of N — M = 10 
training data and M = 2 testing data. 

In both cases the estimators are noisy with a substantial bias, with the hold-out esti¬ 
mator particularly sensitive to overfitting as expected. Although these errors may partially 
cancel when comparing models, any significant cancellation would be rather serendipitous. 

3.3.2 Comparing Prior Predictive Distributions 

Box |19( [20] was a strong proponent of the predictive validation of Bayesian methods, in 
particular the use of the prior predictive distribution. 

One benefit of the prior predictive distribution is that, because it doesn’t depend on 
the measurement, the delta estimator is unbiased. In fact, the delta estimate 
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Likelihood Predictive Performance (Misfit) 



Delta Hold-out Jackknife 


(a) 

Figure 6; Even in the simple misfit model, 
predictive performance scores can leave much 
50%, and 80% quantiles of the estimator error 
latent data generating process. 



Delta Hold-out Jackknife 


(b) 

approximations of likelihood-based relative 
to be desired, as demonstrated by the 20%, 
over an ensemble of measurements from the 
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Figure 7: The simple overfit model also exposes the weakness of approximations of 
likelihood-based relative predictive performance scores can leave much to be desired, as 
demonstrated by the 20%, 50%, and 80% quantiles of the estimator error over an ensem¬ 
ble of measurements from the latent data generating process. Hold-out estimators are 
particularly poor given how sensitive the hold-out fits are to overfitting. 
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Figure 8: Approximations of prior predictive-based relative predictive performance scores 
for the misfit are not terrible, as demonstrated by the 20%, 50%, and 80% quantiles of the 
estimator error over an ensemble of measurements from the latent data generating process, 
but nowhere near precise enough to compare models with similar predictive performance. 


is exactly the logarithm of the marginal likelihood, or evidence, used in Bayesian model 
comparison [211E]) and the difference of estimates between two models is exactly the log- 
odds ratio. Consequently classical Bayesian model comparison also has an interpretation 
in terms of predictive performance. 

That said, the utility of this relative predictive performance score is limited both by the 
large variance of the estimator and a potential overfitting bias if the prior is modified during 
inference. In models where the prior is strongly constrained by previous measurements or 
theoretical conditions this bias may be less of an issue, but care should always be taken. 

As in Section 13.3.11 the simple misfit and overfit examples demonstrate the limitations 
of each estimator (Figures lUllj). 

3.3.3 Comparing Posterior Predictive Distributions 

Alternatively, we can construct relative predictive performance scores in the Bayesian 
paradigm by using the posterior predictive distribution. 

Similar to the use of likelihoods, relative predictive performance scores constructed 
from reuse estimators provide motivation for many visual diagnostics such as Bayesian 
residual analyses [22] and, in particular, posterior-predictive checks [23[ [2l]. Likewise, 
the use of hold-out and jackknife estimators yields posterior predictive hold-out and cross 
validation, |25l I26j . which continues to grow in popularity in the machine learning and 
statistics literature. Consideration of the example of Section 13.3.11 and 13.3.21 emphasizes 
the continued need to maintain vigilance in these applications iFieures 1101 [TT]1 . 

Posterior predictive cross validation also provides the basis for unifying many of the in- 
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Figure 9: The bootstrap estimator of prior predictive-based relative predictive performance 
scores is particularly sensitive to overfitting, as seen in the 20%, 50%, and 80% quantiles 
of the estimator error over an ensemble of measurements from the latent data generating 
process. 
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Figure 10: Approximations of posterior predictive-based relative predictive performance 
scores on the misfit example perform similarly to the approximations from other predic¬ 
tive distributions, once again demonstrated by the 20%, 50%, and 80% quantiles of the 
estimator error over an ensemble of measurements from the latent data generating process. 
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Figure 11: Approximations of posterior predictive-based relative predictive performance 
scores on the overfit example perform similarly to the approximations from other predic¬ 
tive distributions, once again demonstrated by the 20%, 50%, and 80% quantiles of the 
estimator error over an ensemble of measurements from the latent data generating process. 


formation criteria that have been developed in the Bayesian literature. [3127!, for example, 
show how the Widely Applicable Information Criterion |28j . 

N N 

WAIC OC ^logEa, [7rY\x{yn\x)] - ^ Var,^, [log 7ry|2,(?/„|x)] , 

ri.=l n=l 

can be derived as an approximation of the posterior predictive relative predictive perfor¬ 
mance score. Moreover, given a point estimate, x, that singles out one element of the small 
world, the Widely Applicable Information Criterion reduces to the Deviance Information 
Criterion [29] . 


N N 

Die OC E logVry| 2 ,(y„|x) -2^ {logTTY\x{yn\x) - [log 7ry|3.(y„|x)]) . 
n=l n=l 

As with the reuse, hold-out, and jackknife estimators, the ultimate accuracy and precision 
of these information criteria relative to the true posterior predictive relative predictive 
performance score is paramount in any partial application. 

4 Conclusion 

In this paper I have shown how the Kullback-Leibler divergence between an inferred pre¬ 
dictive distribution and the latent data generating process defines a canonical but incom¬ 
putable measure of relative predictive performance. Moreover, I have demonstrated how it 
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can be simplified into relative predictive performance scores which quantify the same rel¬ 
ative predictive performance while also being more amenable to practical approximations. 
Applying various approximation strategies to the relative predictive performance derived 
from predictive distributions in frequentist and Bayesian inference yields many of the model 
comparison techniques ubiquitous in practice, from predictive log loss cross validation to 
the Bayesian evidence and Bayesian information criteria. 

The main benefit in unifying all of these existing methods into a single foundational 
perspective is that it provides a common framework for understanding the limitations of 
these methods and how they can be used responsibly. In particular, it emphasizes that 
these existing methods are all estimates, with uncertain variances and biases that make 
quantitive statements about relative predictive performance, and a hard selection of one 
model above all others under consideration, somewhat precarious. 

This difficulty in making quantitative statements has motivated new approaches to 
model comparison that do not fit each model in isolation but rather fit them as compo¬ 
nents of a single, comprehensive model [301 El]. Such an inclusive strategy offers unique 
computational benefits and an intriguing new interpretation of the small world, and it 
promises to be an exciting area of future research. 
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